feat: add NeMo Gym rollouts pipeline by gwarmstrong · Pull Request #1227 · NVIDIA-NeMo/Skills

gwarmstrong · 2026-02-10T21:43:22Z

NeMo-Gym Rollouts Pipeline

Add ns nemo_gym_rollouts CLI command for orchestrating rollout collection with NeMo Gym, including self-hosted or pre-hosted vLLM servers, optional sandbox, and parallel seed-based scaling.

New script classes:

NemoGymRolloutsScript: orchestrates ng_run + ng_collect_rollouts
MultiVLLMServerScript: data-parallel vLLM with multiple replicas per node (not integrated into pipeline yet)
GymClientScript: multi-node vLLM client (not integrated into pipeline yet)

Infrastructure enhancements:

Command: mounts, environment, workdir, avoid/force nemo_run_code controls
Per-script num_tasks_override for mixed task configurations
Allocation ordering fix for multi-component overlap jobs
NEMO_SKILLS_FORCE_PATTERN_PACKAGER env var for packaging control
--kill-on-bad-exit=1 srun flag for early failure detection

Gym Config Verification: math_with_judge (AIME24) & code_gen (LCB)

Verified that nemo_gym_rollouts pipeline works end-to-end with math_with_judge and code_gen Gym configs on their full validation datasets.

Results

Model: nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 (8x GPU, vLLM, --reasoning-parser deepseek_r1)

Metric	AIME24 (30 problems)	LCB (322 problems)
pass@1	0.9000	0.7259
maj@4	0.9000	0.6739
pass@4	0.9333	0.9720

4 random seeds per benchmark, temperature=1.0, top_p=1.0, max_output_tokens=65536.

Repro: AIME24 (math_with_judge)

Requires AIME24 validation data pre-downloaded to the cluster (see data download step below).

# run_rollouts_aime24.py
from nemo_skills.pipeline.cli import nemo_gym_rollouts, wrap_arguments

run_number = 2
cluster = "<your-cluster>"

nemo_gym_rollouts(
    ctx=wrap_arguments(
        "+agent_name=math_with_judge_simple_agent "
        "+num_samples_in_parallel=64 "
        "+responses_create_params.max_output_tokens=65536 "
        "+responses_create_params.temperature=1.0 "
        "+responses_create_params.top_p=1.0 "
    ),
    num_random_seeds=4,
    starting_seed=0,
    cluster=cluster,
    config_paths=(
        "resources_servers/math_with_judge/configs/dapo17k.yaml,"
        "responses_api_models/vllm_model/configs/vllm_model.yaml"
    ),
    input_file="resources_servers/math_with_judge/data/aime24_validation.jsonl",
    output_dir=f"/workspace/results/math_verification_aime24/run{run_number}",
    gym_path='/opt/nemo-rl/3rdparty/Gym-workspace/Gym',
    with_sandbox=False,
    model='nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16',
    server_type='vllm',
    server_gpus=8,
    server_args='--trust-remote-code --dtype auto --mamba_ssm_cache_dtype float32 --reasoning-parser deepseek_r1 ',
    partition=None,
    exclusive=True,
)

Repro: LCB (code_gen)

Requires LiveCodeBench v5 validation data pre-downloaded to the cluster (see data download step below).

# run_rollouts_lcb.py
from nemo_skills.pipeline.cli import nemo_gym_rollouts, wrap_arguments

run_number = 2
cluster = "<your-cluster>"

nemo_gym_rollouts(
    ctx=wrap_arguments(
        "+agent_name=code_gen_simple_agent "
        "+num_samples_in_parallel=64 "
        "+responses_create_params.max_output_tokens=65536 "
        "+responses_create_params.temperature=1.0 "
        "+responses_create_params.top_p=1.0 "
    ),
    num_random_seeds=4,
    starting_seed=0,
    cluster=cluster,
    config_paths=(
        "resources_servers/code_gen/configs/code_gen.yaml,"
        "responses_api_models/vllm_model/configs/vllm_model.yaml"
    ),
    input_file="resources_servers/code_gen/data/livecodebench_v5_2024-07-01_2025-02-01_validation.jsonl",
    output_dir=f"/workspace/results/math_verification_lcb/run{run_number}",
    gym_path='/opt/nemo-rl/3rdparty/Gym-workspace/Gym',
    with_sandbox=False,
    model='nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16',
    server_type='vllm',
    server_gpus=8,
    server_args='--trust-remote-code --dtype auto --mamba_ssm_cache_dtype float32 --reasoning-parser deepseek_r1 ',
    partition=None,
    exclusive=True,
)

Data Download

Both datasets must be downloaded to the Gym data directories on the cluster before running rollouts. This uses ng_download_dataset_from_hf inside the nemo-rl container:

from nemo_skills.pipeline.cli import run_cmd, wrap_arguments

gym_path = '/opt/nemo-rl/3rdparty/Gym-workspace/Gym'

run_cmd(
    ctx=wrap_arguments(""),
    cluster="<your-cluster>",
    command=(
        f"cd {gym_path} && "
        "uv venv --python 3.12 --allow-existing .venv && "
        "source .venv/bin/activate && "
        "uv sync --active --extra dev && "
        "ng_download_dataset_from_hf "
        "+repo_id=nvidia/Nemotron-RL-math-OpenMathReasoning "
        "+artifact_fpath=validation.jsonl "
        "+output_fpath=resources_servers/math_with_judge/data/aime24_validation.jsonl && "
        "ng_download_dataset_from_hf "
        "+repo_id=nvidia/nemotron-RL-coding-competitive_coding "
        "+artifact_fpath=validation.jsonl "
        "+output_fpath=resources_servers/code_gen/data/livecodebench_v5_2024-07-01_2025-02-01_validation.jsonl"
    ),
    container="nemo-rl",
    expname="download_gym_data",
    partition="cpu",
    num_gpus=0,
)

Summary by CodeRabbit

New Features
- New NeMo Gym rollout CLI: server/sandbox options, policy handling, seed ranges, per-job isolation, dry-run and parallel execution.
- Per-command control for mounts, environment, working directory, and Python import behavior.
Bug Fixes
- Slurm jobs now fail fast if any task exits abnormally.
Chores
- Environment flags and docs for deterministic packaging behavior.
Refactor
- Reorganized pipeline script implementations and public script exports.
Tests
- Added dry-run tests for the NeMo Gym rollout pipeline.

Add `ns nemo_gym_rollouts` CLI command for orchestrating rollout collection with NeMo Gym, including self-hosted or pre-hosted vLLM servers, optional sandbox, and parallel seed-based scaling. New script classes: - NemoGymRolloutsScript: orchestrates ng_run + ng_collect_rollouts - MultiVLLMServerScript: data-parallel vLLM with multiple replicas per node - GymClientScript: multi-node vLLM server discovery via SLURM env vars Infrastructure enhancements: - Command: mounts, environment, workdir, avoid/force nemo_run_code controls - Per-script num_tasks_override for mixed task configurations - Allocation ordering fix for multi-component overlap jobs - NEMO_SKILLS_FORCE_PATTERN_PACKAGER env var for packaging control - --kill-on-bad-exit=1 srun flag for early failure detection Signed-off-by: George Armstrong <georgea@nvidia.com>

greptile-apps

_{7 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-02-13T18:12:26Z

nemo_skills/pipeline/utils/scripts.py

+VLLM_SERVER_URL="{vllm_server_url}"
+if [ -n "$VLLM_SERVER_URL" ]; then
+    echo "=== Waiting for vLLM server at $VLLM_SERVER_URL ==="
+    while [ $(curl -s -o /dev/null -w "%{{http_code}}" "$VLLM_SERVER_URL/models" 2>/dev/null) -ne 200 ]; do


The curl command returns the HTTP status code directly in the condition, but this can fail when curl itself fails (network issues, DNS failures, etc). The command substitution $(curl ...) will return an empty string on curl failure, which bash compares numerically to 200, causing an error.

Suggested change

while [ $(curl -s -o /dev/null -w "%{{http_code}}" "$VLLM_SERVER_URL/models" 2>/dev/null) -ne 200 ]; do

while ! curl -sf "$VLLM_SERVER_URL/models" >/dev/null 2>&1; do

coderabbitai · 2026-02-13T18:17:55Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Adds a Nemo Gym rollouts CLI and new script classes under a scripts subpackage; expands Command with mounts/environment/workdir and PYTHONPATH controls; introduces deterministic packaging flags, fail-fast Slurm srun, tests, docs, and removes the old monolithic scripts file.

Changes

Cohort / File(s)	Summary
CLI & entry `nemo_skills/pipeline/cli.py`, `nemo_skills/pipeline/nemo_gym_rollouts.py`	Adds import/export and registers `nemo_gym_rollouts` CLI command implementing orchestration for servers, sandboxes, seeds, Hydra args, sbatch options, dry-run, and pipeline execution.
New scripts package `nemo_skills/pipeline/utils/scripts/__init__.py`, `nemo_skills/pipeline/utils/scripts/base.py`, `.../scripts/generation.py`, `.../scripts/nemo_gym.py`, `.../scripts/server.py`	Introduces `BaseJobScript`, `GenerationClientScript`, `NemoGymRolloutsScript`, `ServerScript`, `SandboxScript` with command construction, port allocation, env/workdir handling, and per-script overrides.
Removed legacy scripts `nemo_skills/pipeline/utils/scripts.py`	Deletes legacy monolithic `scripts.py` (removes previous class definitions; replaced by new scripts package).
Utils re-exports `nemo_skills/pipeline/utils/__init__.py`	Re-exports new script classes into the utils public namespace.
Declarative / executor `nemo_skills/pipeline/utils/declarative.py`	Extends `Command` with `mounts`, `environment`, `workdir`, `avoid_nemo_run_code`, `force_nemo_run_code`; injects PYTHONPATH/workdir prelude, merges envs/mounts, supports `num_tasks_override`, and adds job reordering heuristic.
Packaging & submission `nemo_skills/pipeline/utils/packager.py`, `nemo_skills/pipeline/utils/exp.py`, `docs/basics/code-packaging.md`	Adds env flags `NEMO_SKILLS_FORCE_INSTALLED_PACKAGE`, `NEMO_SKILLS_FORCE_PATTERN_PACKAGER` and docs; adjusts packager selection; adds `--kill-on-bad-exit=1` to Slurm srun; documents packaging overrides.
Tests `tests/gpu-tests/test_nemo_gym_rollouts.py`	Adds GPU-marked dry-run tests for `nemo_gym_rollouts` covering single and multi-seed dry-run scenarios.

Sequence Diagram(s)

sequenceDiagram
    participant CLI as "User CLI\n(nemo_gym_rollouts)"
    participant Orch as "Pipeline Orchestrator\n(Command / Pipeline)"
    participant Server as "vLLM Server\n(ServerScript)"
    participant Sandbox as "Sandbox\n(SandboxScript)"
    participant Client as "Gym Client\n(NemoGymRolloutsScript)"
    participant Storage as "Storage\n(output_dir / logs)"

    CLI->>Orch: invoke nemo_gym_rollouts(args)
    Orch->>Server: create/configure server jobs (ports, GPUs, nodes)
    Orch->>Sandbox: optionally create sandbox per job
    Orch->>Client: create client jobs with server addresses & env
    Server-->>Client: provide server URL / health status
    Client->>Server: run rollouts against vLLM
    Client->>Storage: write collected rollouts and logs
    Orch->>Storage: aggregate statuses and finalize

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Fix run.Script refactor #1133: Refactors pipeline script abstractions and adds Script-based classes — high potential for API overlap.
Revert "Use run.Script for generate pipeline (#1052)" #1125: Related refactor of scripting/command abstractions touching declarative/executor logic.
Use run.Script for generate pipeline #1052: Adds/wires Script-based classes and CLI usage overlapping this PR's new scripts package.

Suggested labels

run GPU tests

Suggested reviewers

activatedgeek
Kipok

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 72.41% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'feat: add NeMo Gym rollouts pipeline' accurately and specifically describes the main change: adding a new NeMo Gym rollouts pipeline feature to the codebase.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch georgea/nemo-gym-rollouts-pr

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 4

🤖 Fix all issues with AI agents

In `@nemo_skills/pipeline/nemo_gym_rollouts.py`:
- Line 238: The code uses cluster_config["containers"].get(server_type_str,
server_type_str) which can silently accept typos as literal image names; change
this to validate the server_type_str against cluster_config["containers"] (or
use direct access) so missing keys fail fast: check if server_type_str is a key
in cluster_config["containers"] and if so set server_container =
cluster_config["containers"][server_type_str], otherwise raise a clear error (or
fallback to a validated explicit list) so functions referencing server_container
(and the server_type_str lookup) don’t silently accept invalid names.

In `@nemo_skills/pipeline/utils/scripts.py`:
- Around line 930-936: The error block that runs when BEST_COUNT is less than
NUM_NODES prints several echo lines with inconsistent indentation; normalize the
leading spaces so all echo lines align with the surrounding echo and exit
statements. Locate the conditional using BEST_COUNT and NUM_NODES and the echo
statements referencing BEST_VAR_NAME, BEST_NODES_STR, SLURM_JOB_NODELIST,
SLURM_NODELIST, and SLURM_STEP_NODELIST, and remove the extra leading whitespace
on the lines echoing SLURM_* and BEST_NODES_STR so formatting is consistent.
- Around line 966-972: The modulo expression ATTEMPT % (60 /
HEALTH_CHECK_INTERVAL) can divide by zero when HEALTH_CHECK_INTERVAL >= 60; fix
by computing a safe divisor first (e.g. LOG_EVERY = 60 / HEALTH_CHECK_INTERVAL
and if LOG_EVERY < 1 set LOG_EVERY=1) and then use ATTEMPT % LOG_EVERY for the
logging condition; update the block using the variables ATTEMPT,
HEALTH_CHECK_INTERVAL, LOG_EVERY (or similar) and keep the existing ELAPSED and
echo logic unchanged.
- Around line 565-572: The health-check loop uses an unquoted command
substitution which can break if curl returns empty or contains whitespace; fix
the loop in the VLLM server wait logic by capturing the curl result into a
quoted variable and comparing it safely, e.g. status="$(curl -s -o /dev/null -w
"%{http_code}" "$VLLM_SERVER_URL/models" 2>/dev/null)" and then use a quoted
comparison like while [ "$status" != "200" ]; do sleep 10; status="$(...)" done;
ensure the variable and substitutions (VLLM_SERVER_URL and the curl command) are
wrapped in double quotes to be defensive.

🧹 Nitpick comments (5)

nemo_skills/pipeline/utils/packager.py (1)

139-157: Mutually reinforcing flags — consider documenting interaction.

When force_pattern_packager=True, repo_path is None so we skip the if repo_path: branch entirely (going to the else at line 164). When force_installed_nemo_skills=True but force_pattern_packager=False, we enter the git repo branch but include the full installed package. These two flags interact subtly — both are documented in the comments above, which is sufficient, but worth noting that force_installed_nemo_skills has no effect when force_pattern_packager is also set (since the code path that checks it is skipped).

nemo_skills/pipeline/utils/declarative.py (1)

263-294: PYTHONPATH filtering only removes exact /nemo_run/code — intentional?

Line 280 uses grep -v '^/nemo_run/code$' which only removes the exact path /nemo_run/code from PYTHONPATH. If PYTHONPATH contains /nemo_run/code/ (trailing slash) or /nemo_run/code/subdir, those entries would be preserved. This is likely intentional (precise removal), but worth confirming it matches the expected behavior. Subdirectory entries like /nemo_run/code/nemo_skills would also remain.

nemo_skills/pipeline/nemo_gym_rollouts.py (2)

184-206: Validation logic is thorough and covers key mutual-exclusion cases.

The self-hosted vs. pre-hosted validation gates are well-structured. One edge case: if a user provides --server_type with --server_address (pre-hosted), the server_type is silently unused. Consider warning or erroring when --server_type is provided alongside --server_address.

247-255: Broad exception catch is acceptable here but could be narrower.

The except Exception catches all failures when checking remote file existence. Since this involves SSH/cluster operations that can fail unpredictably, the defensive approach with a warning log and safe fallback (include the seed) is pragmatic. If you want to tighten it, catching (OSError, RuntimeError) would cover network/SSH failures while still letting truly unexpected errors surface.

nemo_skills/pipeline/utils/scripts.py (1)

690-691: Remove unused cluster_config field or document its intended purpose.

The cluster_config: Dict parameter is declared but never used anywhere in MultiVLLMServerScript—not in __post_init__ or elsewhere in the class. Unlike ServerScript and SandboxScript, which both pass this to their respective command builders, MultiVLLMServerScript builds its bash script using only other fields (model_path, num_nodes, gpus_per_server, etc.). Per coding guidelines, avoid silently ignoring user-passed parameters; either remove the field or add clear documentation if it's reserved for future use.

nemo_skills/pipeline/nemo_gym_rollouts.py

coderabbitai · 2026-02-13T18:17:59Z

nemo_skills/pipeline/utils/scripts.py

+# Note: --kill-on-bad-exit in srun ensures job fails if vLLM crashes
+VLLM_SERVER_URL="{vllm_server_url}"
+if [ -n "$VLLM_SERVER_URL" ]; then
+    echo "=== Waiting for vLLM server at $VLLM_SERVER_URL ==="
+    while [ $(curl -s -o /dev/null -w "%{{http_code}}" "$VLLM_SERVER_URL/models" 2>/dev/null) -ne 200 ]; do
+        sleep 10
+    done
+    echo "vLLM server is ready!"


⚠️ Potential issue | 🟡 Minor

Unquoted command substitution in health-check loop.

Line 569: The $(curl ...) is not double-quoted. While curl -w "%{http_code}" normally returns "000" on connection failure, quoting defensively is safer.

🛡️ Suggested fix

- while [ $(curl -s -o /dev/null -w "%{{http_code}}" "$VLLM_SERVER_URL/models" 2>/dev/null) -ne 200 ]; do + while [ "$(curl -s -o /dev/null -w "%{{http_code}}" "$VLLM_SERVER_URL/models" 2>/dev/null)" -ne 200 ]; do

🤖 Prompt for AI Agents

In `@nemo_skills/pipeline/utils/scripts.py` around lines 565 - 572, The health-check loop uses an unquoted command substitution which can break if curl returns empty or contains whitespace; fix the loop in the VLLM server wait logic by capturing the curl result into a quoted variable and comparing it safely, e.g. status="$(curl -s -o /dev/null -w "%{http_code}" "$VLLM_SERVER_URL/models" 2>/dev/null)" and then use a quoted comparison like while [ "$status" != "200" ]; do sleep 10; status="$(...)" done; ensure the variable and substitutions (VLLM_SERVER_URL and the curl command) are wrapped in double quotes to be defensive.

nemo_skills/pipeline/utils/scripts.py

coderabbitai · 2026-02-13T18:17:59Z

nemo_skills/pipeline/utils/scripts.py

+            fi
+            # Log progress every 60 seconds
+            if [ $((ATTEMPT % (60 / HEALTH_CHECK_INTERVAL))) -eq 0 ]; then
+                ELAPSED=$(($(date +%s) - START_TIME))
+                echo "  ... still waiting for server $GLOBAL_IDX (${{ELAPSED}}s elapsed)"
+            fi
+            sleep $HEALTH_CHECK_INTERVAL


⚠️ Potential issue | 🟡 Minor

Division-by-zero risk if health_check_interval ≥ 60.

Line 968: ATTEMPT % (60 / HEALTH_CHECK_INTERVAL) — bash integer division truncates, so if HEALTH_CHECK_INTERVAL > 60, 60 / HEALTH_CHECK_INTERVAL evaluates to 0, causing % 0 which is a bash arithmetic error. The default of 10 is safe, but this is a latent issue if someone configures a large interval.

🛡️ Suggested fix

- if [ $((ATTEMPT % (60 / HEALTH_CHECK_INTERVAL))) -eq 0 ]; then + LOG_EVERY=$((60 / HEALTH_CHECK_INTERVAL)) + if [ "$LOG_EVERY" -le 0 ]; then LOG_EVERY=1; fi + if [ $((ATTEMPT % LOG_EVERY)) -eq 0 ]; then

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

fi

# Log progress every 60 seconds

if [ $((ATTEMPT % (60 / HEALTH_CHECK_INTERVAL))) -eq 0 ]; then

ELAPSED=$(($(date +%s) - START_TIME))

echo " ... still waiting for server $GLOBAL_IDX (${{ELAPSED}}s elapsed)"

fi

sleep $HEALTH_CHECK_INTERVAL

fi

# Log progress every 60 seconds

LOG_EVERY=$((60 / HEALTH_CHECK_INTERVAL))

if [ "$LOG_EVERY" -le 0 ]; then LOG_EVERY=1; fi

if [ $((ATTEMPT % LOG_EVERY)) -eq 0 ]; then

ELAPSED=$(($(date +%s) - START_TIME))

echo " ... still waiting for server $GLOBAL_IDX (${{ELAPSED}}s elapsed)"

fi

sleep $HEALTH_CHECK_INTERVAL

🤖 Prompt for AI Agents

In `@nemo_skills/pipeline/utils/scripts.py` around lines 966 - 972, The modulo expression ATTEMPT % (60 / HEALTH_CHECK_INTERVAL) can divide by zero when HEALTH_CHECK_INTERVAL >= 60; fix by computing a safe divisor first (e.g. LOG_EVERY = 60 / HEALTH_CHECK_INTERVAL and if LOG_EVERY < 1 set LOG_EVERY=1) and then use ATTEMPT % LOG_EVERY for the logging condition; update the block using the variables ATTEMPT, HEALTH_CHECK_INTERVAL, LOG_EVERY (or similar) and keep the existing ELAPSED and echo logic unchanged.

- Fix PYTHONPATH stripping to match /nemo_run/code prefixed paths - Re-enable pipefail before ng_collect_rollouts - Fix indentation in GymClientScript error block - Remove emoji from server ready message Signed-off-by: George Armstrong <georgea@nvidia.com>

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

nemo_skills/pipeline/utils/declarative.py (1)
778-802: ⚠️ Potential issue | 🟠 Major

Bug: external_deps is not passed to the first executor when reordering occurs.

After the sort at line 754, prepared_commands is reordered so spanning components appear first (improving SLURM allocation). However, the loop at line 762 checks het_idx == 0 and comp_idx == 0 using the original component indices stored in each entry. If the spanning component that now appears first was originally at comp_idx=1, the condition fails and no executor receives external_deps.

The code already acknowledges this reordering issue at lines 757–759 for shared_packager, but the same problem applies to external_deps. Since nemo-run uses the first executor to determine the SLURM allocation (line 738), the fix is to pass dependencies to the first executor in iteration order:
-        for entry in prepared_commands:
+        for entry_idx, entry in enumerate(prepared_commands):
             het_idx = entry["het_idx"]
             comp_idx = entry["comp_idx"]
             ...
-            exec_dependencies = external_deps if (het_idx == 0 and comp_idx == 0) else None
+            exec_dependencies = external_deps if entry_idx == 0 else None
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/pipeline/utils/declarative.py` around lines 778 - 802,
prepared_commands is reordered so relying on stored het_idx/comp_idx to decide
who gets external_deps is incorrect; instead, determine the "first executor" by
iteration order and pass external_deps to that executor. Modify the loop that
builds executors (the block that calls _create_executor) to compute
exec_dependencies based on the iteration position (e.g., using an iteration
counter or a boolean flag) rather than the stored het_idx/comp_idx, ensuring the
very first created executor receives external_deps; update any existing
shared_packager handling consistently so only the first yielded executor gets
external_deps.

🧹 Nitpick comments (4)

nemo_skills/pipeline/utils/declarative.py (1)
638-644: Use direct dictionary access for exec_config["mounts"].

exec_config["mounts"] is always set at line 302, so .get() is unnecessary. Per coding guidelines, use direct access to fail fast if the key is ever unexpectedly missing.
Suggested fix
-        extra_mounts = exec_config.get("mounts") or None
+        extra_mounts = exec_config["mounts"] or None
As per coding guidelines, "Do not use .get() for accessing dictionary keys if the code expects them to be present; use direct dictionary access dict[key] instead to allow proper error handling and fail fast with clear errors".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/pipeline/utils/declarative.py` around lines 638 - 644, Replace
the use of exec_config.get("mounts") with direct dictionary access
exec_config["mounts"] in the mounts merge block so the code fails fast if the
key is missing; specifically, in the block that computes mounts (using variables
mounts, extra_mounts and calling get_mounts_from_config(cluster_config)), set
extra_mounts = exec_config["mounts"] (keeping the subsequent truthy check and
merge logic unchanged) to follow the guideline of direct access when the key is
expected to exist.
nemo_skills/pipeline/utils/scripts.py (3)
518-535: extra_arguments is forwarded verbatim to both ng_run and ng_collect_rollouts.

If a user passes Hydra overrides that are only valid for one of the two commands, the other will fail with an unknown-override error. Consider splitting into separate parameters (e.g., extra_ng_run_args / extra_ng_collect_args) or at least document the coupling more prominently so callers know every override must be understood by both commands.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/pipeline/utils/scripts.py` around lines 518 - 535,
extra_arguments is forwarded verbatim to both ng_run and ng_collect_rollouts
causing unknown-override failures; split or separate the args so each command
only receives relevant overrides. Change the implementation to accept two
distinct parameters (e.g., extra_ng_run_args and extra_ng_collect_args) and use
extra_ng_run_args when appending to ng_run_parts and building ng_run_cmd, and
use extra_ng_collect_args when appending to ng_collect_parts and building
ng_collect_cmd; update constructor/attribute names and any call sites that set
extra_arguments to pass the appropriate new parameter(s), or alternatively
update documentation to clearly state that extra_arguments must be valid for
both commands if you choose not to split.
639-805: MultiVLLMServerScript — solid design for data-parallel vLLM.

Clean use of SLURM env vars for per-task GPU slicing and port assignment. A few observations:

--trust-remote-code is unconditionally hardcoded (line 800). If a user loads a model from a trusted local checkpoint and wants to avoid this flag, there's no way to opt out. Consider making it configurable or letting users control it via server_args.

The port range (base_port to base_port + total_replicas - 1) is allocated once at init time. If any of those ports are already in use on a target node, the server will fail at runtime. This is inherent to the approach but worth noting in the docstring.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/pipeline/utils/scripts.py` around lines 639 - 805, The script
unconditionally injects --trust-remote-code in MultiVLLMServerScript which
prevents users from opting out; add a boolean field (e.g., trust_remote_code:
bool = True) to the dataclass and conditionally include the
"--trust-remote-code" token in the generated cmd only when that flag is true (or
let users pass it via server_args by default). Update the __post_init__ build
logic that constructs cmd to reference this new field when composing the python3
-m vllm.entrypoints.openai.api_server command, and update the class docstring to
note the port allocation caveat (base_port .. base_port + _total_replicas - 1)
so users know ports may be in-use on target nodes.
625-633: hostname_ref() in env vars won't resolve in heterogeneous jobs.

Line 630 sets NEMO_SKILLS_SANDBOX_HOST to self.sandbox.hostname_ref(). For non-het jobs this returns "127.0.0.1" (fine), but for het jobs it returns a bash expression like ${SLURM_MASTER_NODE_HET_GROUP_0:-localhost}. Since env vars set via nemo-run are literal values (not shell-evaluated), OmegaConf/Gym would receive the unexpanded string.

This isn't an issue for the current single-group use case, but will silently break if someone uses this script in a heterogeneous job. Consider adding a guard or a comment.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/pipeline/utils/scripts.py` around lines 625 - 633, The code sets
NEMO_SKILLS_SANDBOX_HOST to self.sandbox.hostname_ref() which can be a bash
expression for heterogeneous (het) SLURM jobs and will be passed literally;
change the logic around env_vars population so you detect bash-expression
patterns (e.g. contains "${" or starts with "$") from
self.sandbox.hostname_ref() and avoid injecting an unexpanded literal into
env_vars["NEMO_SKILLS_SANDBOX_HOST"] — instead either (a) set a safe fallback
like "localhost"/resolved IP, (b) omit the env var so the target process can
resolve itself, and (c) add a warning log entry; keep
env_vars["NEMO_SKILLS_SANDBOX_PORT"] as-is. Update the code around
self.sandbox.hostname_ref(), self.sandbox.port and the return that builds
{"environment": env_vars} and add a brief comment noting het-only hostname
expressions must not be passed literally.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@nemo_skills/pipeline/utils/scripts.py`:
- Around line 590-605: The health check currently relies on a brittle string
match of STATUS_OUTPUT for "healthy, 0 unhealthy" (in the ng_status loop) which
can break if ng_status output changes; update the loop to use a more robust
signal: prefer ng_status's exit code or a structured output (JSON) if available,
or parse numeric healthy/unhealthy counts with a regex that extracts both
numbers from STATUS_OUTPUT and checks unhealthy == 0; also add a timeout/counter
to the loop to prevent an infinite hang, referencing the ng_status call,
STATUS_OUTPUT, CURRENT_STATUS and LAST_STATUS variables for where to implement
the changes.

---

Outside diff comments:
In `@nemo_skills/pipeline/utils/declarative.py`:
- Around line 778-802: prepared_commands is reordered so relying on stored
het_idx/comp_idx to decide who gets external_deps is incorrect; instead,
determine the "first executor" by iteration order and pass external_deps to that
executor. Modify the loop that builds executors (the block that calls
_create_executor) to compute exec_dependencies based on the iteration position
(e.g., using an iteration counter or a boolean flag) rather than the stored
het_idx/comp_idx, ensuring the very first created executor receives
external_deps; update any existing shared_packager handling consistently so only
the first yielded executor gets external_deps.

---

Duplicate comments:
In `@nemo_skills/pipeline/utils/scripts.py`:
- Around line 565-572: The health-check loop uses unquoted command substitution
for the curl exit code; update the while condition in the VLLM_SERVER_URL block
so the $(curl ...) expansion is wrapped in double quotes to prevent
word-splitting/empty-string issues (i.e., quote the command substitution used in
the while test that queries "$VLLM_SERVER_URL/models" so the comparison against
200 is reliable).
- Around line 933-939: The error block around the BEST_COUNT check has
consistent 4-space indentation for the echo lines but the review contains a
duplicate comment marker; remove the duplicate marker and verify the conditional
block (if [ "$BEST_COUNT" -lt "$NUM_NODES" ]; then ... exit 1) keeps the echo
lines for BEST_VAR_NAME, BEST_NODES_STR, SLURM_JOB_NODELIST, SLURM_NODELIST, and
SLURM_STEP_NODELIST indented consistently and that no stray duplicate comments
remain in scripts.py.
- Around line 963-976: The modulo expression ATTEMPT % (60 /
HEALTH_CHECK_INTERVAL) can divide by zero when HEALTH_CHECK_INTERVAL >= 60; fix
by computing a safe log-frequency before the loop (e.g. LOG_EVERY=$((60 /
HEALTH_CHECK_INTERVAL)); if [ "$LOG_EVERY" -le 0 ]; then LOG_EVERY=1; fi) and
then replace the existing check with if [ $((ATTEMPT % LOG_EVERY)) -eq 0 ]; then
...; use the same variables ATTEMPT, HEALTH_CHECK_INTERVAL, LOG_EVERY, ELAPSED
and GLOBAL_IDX to locate and update the code inside the while true loop.

---

Nitpick comments:
In `@nemo_skills/pipeline/utils/declarative.py`:
- Around line 638-644: Replace the use of exec_config.get("mounts") with direct
dictionary access exec_config["mounts"] in the mounts merge block so the code
fails fast if the key is missing; specifically, in the block that computes
mounts (using variables mounts, extra_mounts and calling
get_mounts_from_config(cluster_config)), set extra_mounts =
exec_config["mounts"] (keeping the subsequent truthy check and merge logic
unchanged) to follow the guideline of direct access when the key is expected to
exist.

In `@nemo_skills/pipeline/utils/scripts.py`:
- Around line 518-535: extra_arguments is forwarded verbatim to both ng_run and
ng_collect_rollouts causing unknown-override failures; split or separate the
args so each command only receives relevant overrides. Change the implementation
to accept two distinct parameters (e.g., extra_ng_run_args and
extra_ng_collect_args) and use extra_ng_run_args when appending to ng_run_parts
and building ng_run_cmd, and use extra_ng_collect_args when appending to
ng_collect_parts and building ng_collect_cmd; update constructor/attribute names
and any call sites that set extra_arguments to pass the appropriate new
parameter(s), or alternatively update documentation to clearly state that
extra_arguments must be valid for both commands if you choose not to split.
- Around line 639-805: The script unconditionally injects --trust-remote-code in
MultiVLLMServerScript which prevents users from opting out; add a boolean field
(e.g., trust_remote_code: bool = True) to the dataclass and conditionally
include the "--trust-remote-code" token in the generated cmd only when that flag
is true (or let users pass it via server_args by default). Update the
__post_init__ build logic that constructs cmd to reference this new field when
composing the python3 -m vllm.entrypoints.openai.api_server command, and update
the class docstring to note the port allocation caveat (base_port .. base_port +
_total_replicas - 1) so users know ports may be in-use on target nodes.
- Around line 625-633: The code sets NEMO_SKILLS_SANDBOX_HOST to
self.sandbox.hostname_ref() which can be a bash expression for heterogeneous
(het) SLURM jobs and will be passed literally; change the logic around env_vars
population so you detect bash-expression patterns (e.g. contains "${" or starts
with "$") from self.sandbox.hostname_ref() and avoid injecting an unexpanded
literal into env_vars["NEMO_SKILLS_SANDBOX_HOST"] — instead either (a) set a
safe fallback like "localhost"/resolved IP, (b) omit the env var so the target
process can resolve itself, and (c) add a warning log entry; keep
env_vars["NEMO_SKILLS_SANDBOX_PORT"] as-is. Update the code around
self.sandbox.hostname_ref(), self.sandbox.port and the return that builds
{"environment": env_vars} and add a brief comment noting het-only hostname
expressions must not be passed literally.

coderabbitai · 2026-02-17T22:23:52Z

nemo_skills/pipeline/utils/scripts.py

+    STATUS_OUTPUT=$(ng_status 2>&1)
+
+    if echo "$STATUS_OUTPUT" | grep -q "healthy, 0 unhealthy"; then
+        echo "All servers ready!"
+        break
+    fi
+
+    # Only print status when it changes (reduce verbosity)
+    CURRENT_STATUS=$(echo "$STATUS_OUTPUT" | grep -oE '[0-9]+ healthy' | head -1 || echo "starting")
+    if [ "$CURRENT_STATUS" != "$LAST_STATUS" ]; then
+        echo "Server status: $CURRENT_STATUS"
+        LAST_STATUS="$CURRENT_STATUS"
+    fi
+
+    sleep 10
+done


⚠️ Potential issue | 🟡 Minor

Fragile health-check string match: "healthy, 0 unhealthy".

If ng_status changes its output format (e.g., wording, ordering, locale), the grep -q "healthy, 0 unhealthy" at line 592 will never match and the loop will spin forever (silent hang with no timeout). Consider using a more structured health-check mechanism (e.g., exit code of ng_status, or a JSON output field) if one is available.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@nemo_skills/pipeline/utils/scripts.py` around lines 590 - 605, The health check currently relies on a brittle string match of STATUS_OUTPUT for "healthy, 0 unhealthy" (in the ng_status loop) which can break if ng_status output changes; update the loop to use a more robust signal: prefer ng_status's exit code or a structured output (JSON) if available, or parse numeric healthy/unhealthy counts with a regex that extracts both numbers from STATUS_OUTPUT and checks unhealthy == 0; also add a timeout/counter to the loop to prevent an infinite hang, referencing the ng_status call, STATUS_OUTPUT, CURRENT_STATUS and LAST_STATUS variables for where to implement the changes.

cmunley1 · 2026-02-17T23:07:44Z

curious if this will support resuming rollouts like with .done file?

gwarmstrong · 2026-02-17T23:18:59Z

curious if this will support resuming rollouts like with .done file?

@cmunley1 no concrete plans at the moment, it would be better if Gym supported resuming natively.

cmunley1 · 2026-02-17T23:23:28Z

im also curious if this would/could enable an end to end reward profiling across many parallel slurm jobs

ie say I want to run 16 rollouts per prompt, for 20,000 prompts, across a few hundred slurm jobs, then aggregate results, so i have avg/stddev reward per task for filtering or curriculum

we have a small feature in nemo gym to aggregate results, but chunking across slurm jobs is not supported in gym natively, it feels more suited here since ns can launch slurm jobs

NVIDIA-NeMo/Gym#621

This is a common workflow that currently requires custom scripts in nemo gym to chunk, submit parallel jobs, and merge result

gwarmstrong · 2026-02-17T23:39:49Z

@cmunley1 we do something similar to ns generate already, we can probably extend it to gym rollouts, but I might do it as a follow-up PR

Kipok

added a few comments. Here is another one from codex, not fully sure if it's meaningful or not, but worth checking

Sorting prepared_commands changes executor order, but external dependencies are still attached based on the original comp_idx == 0 check later in the loop. When a non-spanning command was originally first and a spanning command is reordered ahead of it, the dependency can be attached to a non-leading executor, which can let a run_after job start before its intended upstream dependency.

nemo_skills/pipeline/utils/packager.py

Kipok · 2026-02-18T18:18:21Z

nemo_skills/pipeline/utils/scripts.py

+
+
+@dataclass(kw_only=True)
+class NemoGymRolloutsScript(BaseJobScript):


might be a good time to make scripts a subfolder and split this into different files

fair, let's do that

Kipok · 2026-02-18T18:22:31Z

nemo_skills/pipeline/utils/scripts.py

+
+
+@dataclass(kw_only=True)
+class MultiVLLMServerScript(BaseJobScript):


I'm a bit confused why we need this and below class. vllm should naturally support data parallel with a --data-parallel-size argument. Why can't we just use that with our normal vllm server? And then there is a single ip gym talks to, but it still handled data split, just on the server side?

Yeah that was definitely the preferred route, but I personally found the data parallel size to be error prone and not work very well past a couple nodes. I thought it might be useful to share, but this is unused in the main pipeline at the moment, so I should probably just remove it.

nemo_skills/pipeline/nemo_gym_rollouts.py

Kipok · 2026-02-18T18:23:54Z

nemo_skills/pipeline/utils/scripts.py

+            cmd = f"""set -e
+set -o pipefail
+
+echo "=== Installing NeMo Gym ==="


why do we need to install gym? It should be preinstalled in the container, right?

This is needed when users mount a custom Gym path (e.g., dev branch) over the container's built-in path. The mounted directory may not have a .venv. Added a comment explaining this. The --allow-existing flag makes it fast (~1s) when the venv is already up to date.

Kipok · 2026-02-18T18:24:26Z

nemo_skills/pipeline/utils/scripts.py

+# Wait for vLLM server to be ready before starting ng_run
+# Note: --kill-on-bad-exit in srun ensures job fails if vLLM crashes
+VLLM_SERVER_URL="{vllm_server_url}"
+if [ -n "$VLLM_SERVER_URL" ]; then


can we reuse get_server_wait_cmd here?

Done in 402715f. Replaced with get_server_wait_cmd() from nemo_skills.utils.

Kipok · 2026-02-18T19:05:25Z

nemo_skills/pipeline/nemo_gym_rollouts.py

+        "dummy",
+        help="API key for policy server. Use 'dummy' for local vLLM servers.",
+    ),
+    policy_model_name: str = typer.Option(


can we just reuse the "model" parameter here? That would be consistent with how we do that in main generate pipeline

I'm not sure if what you're saying is completely consistent with ns generate. But I aligned the behavior here with ns generate in 12169c5. --model is the model path/name, --server_address is the pre-hosted URL, and --server_gpus is the discriminator (truthy = self-hosted, falsy = pre-hosted). This matches configure_client() in generation.py where if server_gpus: determines hosting mode.

Kipok · 2026-02-18T19:06:37Z

nemo_skills/pipeline/nemo_gym_rollouts.py

+    if random_seeds is not None:
+        # Explicit seeds provided
+        if isinstance(random_seeds, str):
+            seed_indices = [int(s.strip()) for s in random_seeds.split(",")]


can we reuse str_ids_to_list from main generate pipeline for consistency?

Done in 402715f. Using str_ids_to_list() now.

nemo_skills/pipeline/nemo_gym_rollouts.py

Kipok · 2026-02-18T19:13:51Z

nemo_skills/pipeline/nemo_gym_rollouts.py

+                    skipped_seeds.append(seed_idx)
+                else:
+                    filtered_seeds.append(seed_idx)
+            except Exception as e:


do we expect this to fail? Why blank exception?

Done in 402715f. Removed the try/except — cluster check failures now propagate immediately.

- Split scripts.py into scripts/ package (base, server, generation, nemo_gym) - Remove unused MultiVLLMServerScript and GymClientScript classes - Remove server_address param; model now accepts URLs for pre-hosted servers - Use str_ids_to_list for seed parsing (consistency with generate pipeline) - Add server_container CLI option; use direct dict access for container lookup - Remove try/except on seed skip check — fail hard on cluster errors - Use get_server_wait_cmd for vLLM health check (reuse existing utility) - Add comment explaining why Gym venv install step is needed for mounted paths - Fix external_deps assignment after prepared_commands reordering - Replace .get() with direct dict access throughout Signed-off-by: George Armstrong <georgea@nvidia.com>

Signed-off-by: George Armstrong <georgea@nvidia.com>

Keep --server_address as separate param (matching generate pipeline). Use server_gpus as the discriminator: truthy = self-hosted, falsy = pre-hosted. This matches configure_client() in generation.py. Signed-off-by: George Armstrong <georgea@nvidia.com>

coderabbitai

Actionable comments posted: 6

🧹 Nitpick comments (1)

nemo_skills/pipeline/utils/scripts/server.py (1)
136-142: Use direct key access for required metadata fields.

metadata["environment"] is expected by contract here; using .get() can silently hide shape regressions.
♻️ Proposed fix
 def build_cmd() -> Tuple[str, Dict]:
-    env = dict(metadata.get("environment", {}))
+    env = dict(metadata["environment"])
     if self.env_overrides:
         for override in self.env_overrides:
             key, value = override.split("=", 1)
             env[key] = value
As per coding guidelines: "Don't use `.get()` for accessing dictionary keys if the code expects them to be present; use direct access `data[key_name]` to fail with a clear error instead of silently corrupting data".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/pipeline/utils/scripts/server.py` around lines 136 - 142, The
code in build_cmd uses metadata.get("environment", {}) which can hide missing
required fields; change this to directly access metadata["environment"] (e.g.,
env = dict(metadata["environment"])) so a KeyError is raised if the environment
key is absent, then apply self.env_overrides as before and return cmd,
{"environment": env}; reference build_cmd, metadata, and self.env_overrides when
making the change.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@nemo_skills/pipeline/nemo_gym_rollouts.py`:
- Around line 188-202: When model is a pre-hosted URL (pre_hosted == True),
reject any self-hosted-only flags instead of silently ignoring them: detect if
any of the parameters server_type, server_gpus, server_nodes, server_args, or
server_container are non-None/used while pre_hosted is True and raise a
ValueError stating those flags are invalid for pre-hosted mode; update the
initial validation block around pre_hosted/self_hosted (and the similar
validation later in the same file where server_* flags are checked) to perform
this check and include the offending flag names in the error message so users
fail fast on misconfiguration.
- Around line 221-233: The seed-selection logic around variables random_seeds,
num_random_seeds, seed_indices and helper str_ids_to_list needs validation:
explicitly validate num_random_seeds is an int > 0 (reject or raise/log error
for 0 or negative values) instead of relying on truthiness, and when parsing
random_seeds ensure you convert entries to ints, deduplicate them (preserve
order) to avoid duplicate job/output names, and reject/normalize
invalid/non-negative seeds; apply the same validation/deduplication fix to the
other occurrences mentioned (the blocks using seed_indices at lines referred to)
so all job-generation code uses the validated, deduplicated list or errors out
with a clear message.

In `@nemo_skills/pipeline/utils/scripts/generation.py`:
- Around line 86-94: The current construction of server_addresses can ignore or
misindex pre-hosted addresses because server_addresses_prehosted is only used
when self.servers is present and there are no length checks; update the logic
that builds server_addresses (the block using self.servers,
server_script.hostname_ref(), server_script.port,
self.server_addresses_prehosted and enumerate) to validate inputs: enforce that
self.servers and self.server_addresses_prehosted are mutually exclusive or
explicitly allowed, check that when both are intended to be used their lengths
match, and guard the access to self.server_addresses_prehosted[server_idx] with
an index check — otherwise raise a clear ValueError indicating the mismatch or
unsupported combination so caller errors rather than producing opaque failures.

In `@nemo_skills/pipeline/utils/scripts/nemo_gym.py`:
- Around line 62-80: The __post_init__ currently silently ignores server_address
when server is also set; update the initialization to validate inputs and fail
fast: in __post_init__ (before build_cmd is used) check if both server and
server_address are provided and raise a ValueError (or TypeError) with a clear
message, referencing the conflicting fields server and server_address so callers
know to pass only one; keep the existing build_cmd logic unchanged but rely on
this early validation to prevent silent shadowing when building ng_run_parts and
appending '+policy_base_url=...'.

In `@nemo_skills/pipeline/utils/scripts/server.py`:
- Around line 126-134: The __post_init__ currently allows calling
sandbox_command with port=None when allocate_port is False, causing invalid
runtime env and delayed failures; update __post_init__ (in the class that
defines allocate_port and port) to validate that if allocate_port is False then
port is not None and raise a clear ValueError (or similar) immediately,
otherwise if allocate_port is True keep allocating via
get_free_port(strategy="random") and proceed to call
sandbox_command(cluster_config=self.cluster_config, port=self.port); this
ensures failures are fast and surfaces the missing-port configuration early.
- Around line 79-93: In __post_init__, add a fail-fast check that raises a clear
exception if allocate_port is False and port is None: if not self.allocate_port
and self.port is None raise a ValueError (or custom exception) with a message
like "port must be provided when allocate_port is False" before calling
get_server_command; this ensures callers of server.__post_init__ (and the
get_server_command call) don't proceed with a None server_port.

---

Nitpick comments:
In `@nemo_skills/pipeline/utils/scripts/server.py`:
- Around line 136-142: The code in build_cmd uses metadata.get("environment",
{}) which can hide missing required fields; change this to directly access
metadata["environment"] (e.g., env = dict(metadata["environment"])) so a
KeyError is raised if the environment key is absent, then apply
self.env_overrides as before and return cmd, {"environment": env}; reference
build_cmd, metadata, and self.env_overrides when making the change.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 03f3ee7 and aa2320e.

📒 Files selected for processing (11)

docs/basics/code-packaging.md
nemo_skills/pipeline/nemo_gym_rollouts.py
nemo_skills/pipeline/utils/__init__.py
nemo_skills/pipeline/utils/declarative.py
nemo_skills/pipeline/utils/scripts.py
nemo_skills/pipeline/utils/scripts/__init__.py
nemo_skills/pipeline/utils/scripts/base.py
nemo_skills/pipeline/utils/scripts/generation.py
nemo_skills/pipeline/utils/scripts/nemo_gym.py
nemo_skills/pipeline/utils/scripts/server.py
tests/gpu-tests/test_nemo_gym_rollouts.py

💤 Files with no reviewable changes (1)

nemo_skills/pipeline/utils/scripts.py

nemo_skills/pipeline/nemo_gym_rollouts.py

coderabbitai · 2026-03-04T00:40:15Z

nemo_skills/pipeline/utils/scripts/generation.py

+            server_addresses = None
+            if self.servers is not None:
+                server_addresses = []
+                for server_idx, server_script in enumerate(self.servers):
+                    if server_script is not None:
+                        addr = f"{server_script.hostname_ref()}:{server_script.port}"
+                    else:
+                        addr = self.server_addresses_prehosted[server_idx]
+                    server_addresses.append(addr)


⚠️ Potential issue | 🟠 Major

Pre-hosted server addresses can be ignored or misindexed.

Right now, pre-hosted addresses are only used when servers is provided, and missing length checks can produce opaque failures.

🔧 Proposed fix

server_addresses = None if self.servers is not None: + if any(server is None for server in self.servers): + if self.server_addresses_prehosted is None: + raise ValueError( + "server_addresses_prehosted is required when `servers` contains None entries." + ) + if len(self.server_addresses_prehosted) != len(self.servers): + raise ValueError("server_addresses_prehosted must match the length of `servers`.") server_addresses = [] for server_idx, server_script in enumerate(self.servers): if server_script is not None: addr = f"{server_script.hostname_ref()}:{server_script.port}" else: addr = self.server_addresses_prehosted[server_idx] server_addresses.append(addr) + elif self.server_addresses_prehosted is not None: + server_addresses = list(self.server_addresses_prehosted)

As per coding guidelines: "Avoid cases where user-passed parameters are unused; code should fail if user specifies an unsupported argument or if a required argument is missing. Use dataclass or **kwargs syntax to handle this automatically".

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@nemo_skills/pipeline/utils/scripts/generation.py` around lines 86 - 94, The current construction of server_addresses can ignore or misindex pre-hosted addresses because server_addresses_prehosted is only used when self.servers is present and there are no length checks; update the logic that builds server_addresses (the block using self.servers, server_script.hostname_ref(), server_script.port, self.server_addresses_prehosted and enumerate) to validate inputs: enforce that self.servers and self.server_addresses_prehosted are mutually exclusive or explicitly allowed, check that when both are intended to be used their lengths match, and guard the access to self.server_addresses_prehosted[server_idx] with an index check — otherwise raise a clear ValueError indicating the mismatch or unsupported combination so caller errors rather than producing opaque failures.

nemo_skills/pipeline/utils/scripts/nemo_gym.py

nemo_skills/pipeline/utils/scripts/server.py

gwarmstrong · 2026-03-04T00:43:08Z

Here is another one from codex, not fully sure if it's meaningful or not, but worth checking
Sorting prepared_commands changes executor order, but external dependencies are still attached based on the original comp_idx == 0 check later in the loop.

Good catch. Done in 402715f. Changed to use entry_idx == 0 instead of comp_idx == 0 so the first executor in iteration order always gets the dependency.

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (2)

nemo_skills/pipeline/nemo_gym_rollouts.py (2)

221-233: ⚠️ Potential issue | 🟠 Major

Validate seed inputs explicitly to avoid empty runs and output collisions.

elif num_random_seeds: relies on truthiness (0 falls through; negatives can create empty ranges), and explicit seed lists are not validated for duplicates/non-negative values. This can lead to duplicate job/output names downstream (rollouts-rs{seed}.jsonl) or silently creating zero jobs.

Suggested fix

     if random_seeds is not None:
         if isinstance(random_seeds, str):
             seed_indices = str_ids_to_list(random_seeds)
         else:
             seed_indices = list(random_seeds)
+        if not seed_indices:
+            raise ValueError("--random_seeds must contain at least one seed")
+        if any(seed < 0 for seed in seed_indices):
+            raise ValueError("--random_seeds must contain non-negative integers")
+        if len(seed_indices) != len(set(seed_indices)):
+            raise ValueError("--random_seeds must not contain duplicate values")
         LOG.info(f"Using explicit seeds: {seed_indices}")
-    elif num_random_seeds:
+    elif num_random_seeds is not None:
+        if num_random_seeds <= 0:
+            raise ValueError("--num_random_seeds must be > 0")
+        if starting_seed < 0:
+            raise ValueError("--starting_seed must be >= 0")
         seed_indices = list(range(starting_seed, starting_seed + num_random_seeds))
         LOG.info(
             f"Creating {num_random_seeds} separate jobs (rs{starting_seed}..rs{starting_seed + num_random_seeds - 1})"
         )

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/pipeline/nemo_gym_rollouts.py` around lines 221 - 233, The
seed-selection logic around random_seeds/num_random_seeds should validate
inputs: ensure num_random_seeds is an int >= 1 (do not rely on truthiness) and
starting_seed is a non-negative int before constructing seed_indices, and when
random_seeds is provided (handled by str_ids_to_list/random_seeds branch)
convert elements to ints, check they are non-negative and deduplicate (or raise
ValueError) to avoid duplicate job names and empty ranges; update the block that
sets seed_indices (using random_seeds, str_ids_to_list, num_random_seeds,
starting_seed) to perform these checks and raise/return a clear error if
validation fails instead of silently producing empty or duplicate seed lists.

189-204: ⚠️ Potential issue | 🟠 Major

Reject mode-incompatible flags instead of silently ignoring them.

In pre-hosted mode, self-hosted-only flags (--server_type, --server_gpus, --server_nodes, --server_args, --server_container) are currently accepted but ignored; similarly, --server_address should be rejected in self-hosted mode. Fail fast here to prevent silent misconfiguration.

Suggested fix

     if pre_hosted and server_address is None:
         raise ValueError("--server_address is required when not using self-hosted server (no --server_gpus)")
+
+    if pre_hosted:
+        invalid_flags = []
+        if server_type is not None:
+            invalid_flags.append("--server_type")
+        if server_gpus is not None:
+            invalid_flags.append("--server_gpus")
+        if server_nodes != 1:
+            invalid_flags.append("--server_nodes")
+        if server_args:
+            invalid_flags.append("--server_args")
+        if server_container is not None:
+            invalid_flags.append("--server_container")
+        if invalid_flags:
+            raise ValueError(f"{', '.join(invalid_flags)} are only valid in self-hosted mode")
+
+    if self_hosted and server_address is not None:
+        raise ValueError("--server_address cannot be used with self-hosted mode")

As per coding guidelines: "Avoid cases where user-passed parameters are unused; code should fail if user specifies an unsupported argument or if a required argument is missing. Use dataclass or **kwargs syntax to handle this automatically".

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/pipeline/nemo_gym_rollouts.py` around lines 189 - 204, The
current hosting-mode validation accepts mode-incompatible flags silently; update
the checks around self_hosted/pre_hosted (variables self_hosted, pre_hosted) to
reject incompatible arguments: when pre_hosted is true, if any of server_type,
server_gpus, server_nodes, server_args, or server_container are not None, raise
a ValueError naming the offending flags; conversely when self_hosted is true and
server_address is not None, raise a ValueError rejecting --server_address; keep
the existing required-flag checks for model, server_type and policy_model_name
intact but place these new incompatibility checks near the top of the block so
the function fails fast on misconfiguration.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@nemo_skills/pipeline/nemo_gym_rollouts.py`:
- Around line 185-187: The parsed config paths list (config_paths_list obtained
from config_paths.split(...)) can be empty after stripping, so add an explicit
fail-fast check right after building config_paths_list: if not
config_paths_list: log an error via LOG.error (or raise a ValueError) with a
clear message like "No valid --config_paths provided; please supply at least one
path" and raise/exit before any job construction or further processing (refer to
the config_paths and config_paths_list variables in nemo_gym_rollouts.py).

---

Duplicate comments:
In `@nemo_skills/pipeline/nemo_gym_rollouts.py`:
- Around line 221-233: The seed-selection logic around
random_seeds/num_random_seeds should validate inputs: ensure num_random_seeds is
an int >= 1 (do not rely on truthiness) and starting_seed is a non-negative int
before constructing seed_indices, and when random_seeds is provided (handled by
str_ids_to_list/random_seeds branch) convert elements to ints, check they are
non-negative and deduplicate (or raise ValueError) to avoid duplicate job names
and empty ranges; update the block that sets seed_indices (using random_seeds,
str_ids_to_list, num_random_seeds, starting_seed) to perform these checks and
raise/return a clear error if validation fails instead of silently producing
empty or duplicate seed lists.
- Around line 189-204: The current hosting-mode validation accepts
mode-incompatible flags silently; update the checks around
self_hosted/pre_hosted (variables self_hosted, pre_hosted) to reject
incompatible arguments: when pre_hosted is true, if any of server_type,
server_gpus, server_nodes, server_args, or server_container are not None, raise
a ValueError naming the offending flags; conversely when self_hosted is true and
server_address is not None, raise a ValueError rejecting --server_address; keep
the existing required-flag checks for model, server_type and policy_model_name
intact but place these new incompatibility checks near the top of the block so
the function fails fast on misconfiguration.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between aa2320e and 12169c5.

📒 Files selected for processing (1)

nemo_skills/pipeline/nemo_gym_rollouts.py

coderabbitai · 2026-03-04T00:46:04Z

nemo_skills/pipeline/nemo_gym_rollouts.py

+    # Parse config paths
+    config_paths_list = [p.strip() for p in config_paths.split(",") if p.strip()]
+    LOG.info(f"Config paths: {config_paths_list}")


⚠️ Potential issue | 🟡 Minor

Fail fast when parsed --config_paths is empty.

--config_paths is required, but after parsing/stripping it can still become an empty list (e.g., whitespace/comma-only input). Add an explicit check to raise a clear error before job construction.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@nemo_skills/pipeline/nemo_gym_rollouts.py` around lines 185 - 187, The parsed config paths list (config_paths_list obtained from config_paths.split(...)) can be empty after stripping, so add an explicit fail-fast check right after building config_paths_list: if not config_paths_list: log an error via LOG.error (or raise a ValueError) with a clear message like "No valid --config_paths provided; please supply at least one path" and raise/exit before any job construction or further processing (refer to the config_paths and config_paths_list variables in nemo_gym_rollouts.py).

- Reject self-hosted flags (server_type, server_args, etc.) in pre-hosted mode - Validate seed inputs: reject num_random_seeds <= 0 and duplicate seeds - Reject conflicting server + server_address in NemoGymRolloutsScript - Fail fast when allocate_port=False and port is None (ServerScript, SandboxScript) Signed-off-by: George Armstrong <georgea@nvidia.com>

Signed-off-by: George Armstrong <georgea@nvidia.com>

Kipok

probably worth running slurm tests as well here

gwarmstrong force-pushed the georgea/nemo-gym-rollouts-pr branch 2 times, most recently from d61b5f4 to d86e6c9 Compare February 10, 2026 22:02

Merge branch 'main' into georgea/nemo-gym-rollouts-pr

4e6e6cf

gwarmstrong marked this pull request as ready for review February 13, 2026 18:10

gwarmstrong requested a review from Kipok February 13, 2026 18:11

greptile-apps bot reviewed Feb 13, 2026

View reviewed changes

coderabbitai bot reviewed Feb 13, 2026

View reviewed changes

gwarmstrong added 2 commits February 17, 2026 13:43

Merge branch 'main' into georgea/nemo-gym-rollouts-pr

268b2e6

coderabbitai bot reviewed Feb 17, 2026

View reviewed changes

Merge branch 'main' into georgea/nemo-gym-rollouts-pr

03f3ee7

Kipok reviewed Feb 18, 2026

View reviewed changes

gwarmstrong added reviewed needs changes labels Feb 27, 2026

gwarmstrong added 4 commits March 3, 2026 16:31

docs: document packaging override env vars

278f370

Signed-off-by: George Armstrong <georgea@nvidia.com>

test: add GPU dry-run tests for nemo_gym_rollouts pipeline

aa2320e

Signed-off-by: George Armstrong <georgea@nvidia.com>

coderabbitai bot reviewed Mar 4, 2026

View reviewed changes

gwarmstrong added 3 commits March 3, 2026 16:46

Merge branch 'main' into georgea/nemo-gym-rollouts-pr

a11a15d

Signed-off-by: George Armstrong <georgea@nvidia.com>

Merge branch 'main' into georgea/nemo-gym-rollouts-pr

4571a2e

gwarmstrong requested a review from Kipok March 4, 2026 20:28

gwarmstrong removed reviewed needs changes labels Mar 4, 2026

gwarmstrong changed the title ~~feat: add NeMo Gym rollouts pipeline and multi-node vLLM support~~ feat: add NeMo Gym rollouts pipeline Mar 4, 2026

Kipok approved these changes Mar 5, 2026

View reviewed changes

gwarmstrong added 2 commits March 5, 2026 09:37

Merge branch 'main' into georgea/nemo-gym-rollouts-pr

241c57f

Merge branch 'main' into georgea/nemo-gym-rollouts-pr

ab7f44a

	while [ $(curl -s -o /dev/null -w "%{{http_code}}" "$VLLM_SERVER_URL/models" 2>/dev/null) -ne 200 ]; do
	while ! curl -sf "$VLLM_SERVER_URL/models" >/dev/null 2>&1; do



		@dataclass(kw_only=True)
		class NemoGymRolloutsScript(BaseJobScript):



		@dataclass(kw_only=True)
		class MultiVLLMServerScript(BaseJobScript):

Conversation

gwarmstrong commented Feb 10, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

NeMo-Gym Rollouts Pipeline

Gym Config Verification: math_with_judge (AIME24) & code_gen (LCB)

Results

Repro: AIME24 (math_with_judge)

Repro: LCB (code_gen)

Data Download

Summary by CodeRabbit

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot commented Feb 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 17, 2026

Choose a reason for hiding this comment

Uh oh!

cmunley1 commented Feb 17, 2026

Uh oh!

gwarmstrong commented Feb 17, 2026

Uh oh!

cmunley1 commented Feb 17, 2026

Uh oh!

gwarmstrong commented Feb 17, 2026

Uh oh!

Kipok left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

gwarmstrong commented Feb 10, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Feb 13, 2026 •

edited

Loading